01 데이터 클렌싱 | ✅ 저자: 이유정(박사)

🧼 데이터 클렌징이란? 더러운 데이터를 깨끗하게 만드는 작업이에요. 우리가 컴퓨터로 분석하고 싶은 데이터는 항상 완벽하지 않아요.
현실에서 수집한 데이터에는 빠진 값, 중복된 값, 틀린 형식, 이상한 표현 등이 섞여 있을 수 있어요. 데이터 클렌징은 그런 문제들을 찾고, 고치고, 정리하는 과정이에요.

아래처럼 고객 정보를 담은 데이터가 있다고 해볼게요:

이름	이메일	나이	가입일
김민지	minji@email.com	25	2023-01-01
이철수		33	2023/02/05
김민지	minji@email.com	25	2023-01-01
박영희	younghee@email	스물다섯	2023-03-10

자주 생기는 데이터 문제들

누락된 값 (Missing Values) → 값이 비어 있는 칸이 있어요.
예: 이철수님의 이메일이 비어 있음
중복된 데이터 (Duplicated Data) → 같은 내용이 여러 번 들어가 있어요.
예: 김민지님의 정보가 두 번
잘못된 형식 (Wrong Format) → 날짜나 숫자 등 형식이 제각각이에요.
예: 가입일이 어떤 건 2023-01-01, 어떤 건 2023/02/05
일관성 없는 데이터 (Inconsistent Data) → 사람이 입력하다 보니 표현이 제각각이에요.
예: 나이가 숫자가 아니라 ‘스물다섯’이라고 써있음

그래서 어떻게 고칠까? (Pandas로)

isnull(), dropna() → 비어 있는 값 찾고 제거
duplicated() → 중복된 행 찾기
drop_duplicates() → 중복 제거
str.replace() → 문자열 고치기 (예: “스물다섯” → 25)
to_datetime() → 날짜 형식 통일

왜 데이터 클렌징이 중요할까?

정확한 분석을 위해 꼭 필요해요.
엉터리 데이터가 들어가면 잘못된 결과가 나올 수 있어요.

데이터 클렌징 예시 코드

import pandas as pd

# 1. 데이터 불러오기
df = pd.read_csv("csv_files/combined_customers.csv")

# 2. 결측치 확인
print("결측치 확인:")
print(df.isnull().sum())
print()

# 3. age, email 결측치 비율 확인
total = len(df)
print(f"총 행 개수: {total}")
print("age 결측률:", df['age'].isnull().sum() / total)
print("email 결측률:", df['email'].isnull().sum() / total)
print()

# 4. 이메일 또는 나이가 모두 결측인 행 제거
df_cleaned = df[~(df['age'].isnull() & df['email'].isnull())]
print(f"email과 age 모두 결측인 행 제거 후: {len(df_cleaned)}행")

# 5. 나이 평균으로 결측치 채우기
mean_age = df_cleaned['age'].mean()
df_cleaned['age'] = df_cleaned['age'].fillna(round(mean_age))

# 6. 이메일 결측치: "unknown@example.com" 형식으로 채우기
df_cleaned['email'] = df_cleaned['email'].fillna("unknown@example.com")

# 7. 중복 제거 (customer_id 기준으로)
df_cleaned = df_cleaned.drop_duplicates(subset='customer_id')

# 8. 결과 확인
print("\n클렌징 후 데이터:")
print(df_cleaned.head())

# 9. 저장 (선택)
df_cleaned.to_csv("csv_files/combined_customers_cleaned.csv", index=False)

결과:

결측치 확인:
customer_id     0
name            0
age            22
email          22
join_date       0
dtype: int64

총 행 개수: 500
age 결측률: 0.044
email 결측률: 0.044

email과 age 모두 결측인 행 제거 후: 488행

클렌징 후 데이터:
   customer_id       name   age                  email  \
0            1  Customer1  46.0  customer1@example.com   
1            2  Customer2  46.0  customer2@example.com   
2            3  Customer3  46.0  customer3@example.com   
3            4  Customer4  46.0  customer4@example.com   
4            5  Customer5  46.0  customer5@example.com   

                    join_date  
0  2023-01-27 06:52:22.275679  
1  2023-07-15 06:52:22.275698  
2  2023-01-24 06:52:22.275704  
3  2023-03-25 06:52:22.275709  
4  2023-03-25 06:52:22.275713

← 이전: 01 데이터 병합 조인

다음 →: 01 데이터 포맷 수정

💡 AI 인사이트

댓글 커뮤니티

검색

01 데이터 클렌싱 | ✅ 저자: 이유정(박사)

Python 코드 실행기

📝 입력값 (자동 생성됨)

📤 실행 결과:

사이트 및 광고 문의